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Abstract 

We study decision making in environments 
where the reward is only partially observed, but 
can be modeled as a function of an action and an 
observed context. This setting, known as con- 
textual bandits, encompasses a wide variety of 
applications including health-care policy and In- 
ternet advertising. A central task is evaluation 
of a new policy given historic data consisting of 
contexts, actions and received rewards. The key 
challenge is that the past data typically does not 
faithfully represent proportions of actions taken 
by a new policy. Previous approaches rely ei- 
ther on models of rewards or models of the past 
policy. The former are plagued by a large bias 
whereas the latter have a large variance. 

In this work, we leverage the strength and over- 
come the weaknesses of the two approaches by 
applying the doubly robust technique to the prob- 
lems of policy evaluation and optimization. We 
prove that this approach yields accurate value es- 
timates when we have either a good (but not nec- 
essarily consistent) model of rewards or a good 
(but not necessarily consistent) model of past 
policy. Extensive empirical comparison demon- 
strates that the doubly robust approach uniformly 
improves over existing techniques, achieving 
both lower variance in value estimation and bet- 
ter policies. As such, we expect the doubly robust 
approach to become common practice. 



1. Introduction 

We study decision making in environments where we re- 
ceive feedback only for chosen actions. For example, in 
Internet advertising, we find only whether a user clicked 
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on some of the presented ads, but receive no information 
about the ads that were not presented. In health care, we 
only find out success rates for patients who received the 
treatments, but not for the alternatives. B oth of these prob 



lems are instances of contextual bandits (lAuer et al.L 12002; 



Langford & Zhangl 2008 ). The context refers to additional 
information about the user or patient. Here, we focus 
on the offline version: we assume access to historic data, 



but no ability to gather new data dLangford et al.L [2008 



Strehl et al.,2011) 



Two kinds of approaches address offline learning in con- 
textual bandits. The first, which we call the direct method 
(DM), estimates the reward function from given data and 
uses this estimate in place of actual reward to evaluate the 
policy value on a set of contexts. T he second kind, called 
inverse propensity score (IPS) dHorvitz & Thompson , 



1952), uses importance weighting to correct for the incor- 
rect proportions of actions in the historic data. The first 
approach requires an accurate model of rewards, whereas 
the second approach requires an accurate model of the past 
policy. In general, it might be difficult to accurately model 
rewards, so the first assumption can be too restrictive. On 
the other hand, it is usually possible to model the past pol- 
icy quite well. However, the second kind of approach often 
suffers from large variance especially when the past policy 
differs significantly from the policy being evaluated. 

In this paper, we propose to use the technique of dou- 
bly robust (DR) estimation to overcome problems with the 
two existing appr oaches. Doubly robust (or doubly pro- 
tected) estimation dCassel et al. , 19761: Robins et al. , 1994 : 



Robins & Rotnitzkvl Il995b iLunceford & Davidianf 12004 : 
Kang & SchafeJ, 2007h is a statistical approach for estima- 
tion from incomplete data with an important property: if 
either one of the two estimators (in DM and IPS) is correct, 
then the estimation is unbiased. This method thus increases 
the chances of drawing reliable inference. 

For example, when conducting a survey, seemingly ancil- 
lary questions such as age, sex, and family income may be 
asked. Since not everyone contacted responds to the sur- 
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vey, these values along with census statistics may be used 
to form an estimator of the probability of a response condi- 
tioned on age, sex, and family income. Using importance 
weighting inverse to these estimated probabilities, one esti- 
mator of overall opinions can be formed. An alternative es- 
timator can be formed by directly regressing to predict the 
survey outcome given any available sources of information. 
Doubly robust estimation unifies these two techniques, so 
that unbiasedness is guaranteed if either the probability es- 
timate is accurate or the regressed predictor is accurate. 

We apply the doubly robust technique to policy value esti- 
mation in a contextual bandit setting. The core technique is 
analyzed in terms of bias in Section[3]and variance in Sec- 
tion |4] Unlike previous theoretical analyses, we do not as- 
sume that either the reward model or the past policy model 
are correct. Instead, we show how the deviations of the 
two models from the truth impact bias and variance of the 
doubly robust estimator. To our knowledge, this style of 
analysis is novel and may provide insights into doubly ro- 
bust estimation beyond the specific setting studied here. In 
Section [5] we apply this method to both policy evaluation 
and optimization, finding that this approach substantially 
sharpens existing techniques. 

1.1. Prior Work 

Doubly robust esti mation is widely u sed in statistical in- 
ference (see, e.g., iKang & Schaferl (120071) and the ref- 
erences therein). More recently, it has been used in 
Internet advertising to estim ate the effects of new fea- 
tures for online a dvertisers (Lambert & Pregibon 120071 



Chan et all l2010h . Previous work focuses 



on parame- 
ter estimation rather than policy evaluation/optimization, 
as addressed here. Furthermore, most of previous anal- 
ysis of doubly robust estimation studies asymptotic be- 
havior or re li es on v arious modeling assumptions (e.g. 



Robins et all (1 1994. ILunceford & Davidianl d2004 . and 



Kang & Schafej (boom Our analysis is non-asymptotic 
and makes no such assumptions. 

Several other papers in machine learning have used 
ideas related to the basic technique discussed here, al- 
though not wi t h the same language. For benign bandits, 
Hazan & Kalel ( l2009h construct algorithms which use re- 
ward estimators in order to achieve a worst-case regret that 
depends on the variance of the ba ndit rather than time. Sim 



ilarly , the Offset Tree algorithm (IBeygelzimer & Langford 



2009) can be thought of as using a crude reward estimate 
for the "offset". In both cases, the algorithms and estima- 
tors described here are substantially more sophisticated. 

2. Problem Definition and Approach 

Let X be an input space and A = {1, . . . , k} a finite action 
space. A contextual bandit problem is specified by a distri- 



bution D over pairs (x, r) where x € X is the context and 
r 6 [0, 1 J' 4 is a vector of rewards. The input data has been 
generated using some unknown policy (possibly adaptive 
and randomized) as follows: 

• The world draws a new example (x, r) ~ D. Only x 
is revealed. 

• The policy chooses an action a <~ p(a \ x, h), where 
h is the history of previous observations (that is, the 
concatenation of all preceding contexts, actions and 
observed rewards). 

• Reward r a is revealed. It should be emphasized that 
other rewards r a i with a' ^ a are not observed. 

Note that neither the distribution D nor the policy p is 
known. Given a data set S = {(x, h, a, r a )} collected as 
above, we are interested in two tasks: policy evaluation and 
policy optimization. In policy evaluation, we are interested 
in estimating the value of a stationary policy it, defined as: 

On the other hand, the goal of policy optimization is 
to find an optimal policy with maximum value: ir* = 
argmax^ V m ' . In the theoretical sections of the paper, 
we treat the problem of policy evaluation. It is expected 
that better evaluation g enerally leads to better optimiza- 
tion ( IStrehl et al.l 1201 lb . In the experimental section, we 
study how our policy evaluation approach can be used for 
policy optimization in a classification setting. 

2.1. Existing Approaches 

The key challenge in estimating policy value, given the data 
as described in the previous section, is the fact that we only 
have partial information about the reward, hence we can- 
not directly simulate our proposed policy on the data set 
S. There are two common solutions for overcoming this 
limitation. The first, called direct method (DM), forms an 
estimate g a (x) of the expected reward conditioned on the 
context and action. The policy value is then estimated by 

Van = Tci ^{x){x) . 

|i| x£S 

Clearly, if g a (x) is a good approximation of the true ex- 



pected reward, defined as g a (x) 



E 



then 



the DM estimate is close to V"*. Also, if g is unbiased, 
Vq M is an unbiased estimate of V* . A problem with 
this method is that the estimate g is formed without the 
knowledge of ir and hence might focus on approximat- 
ing g mainly in the areas that are irrelevant for V 11 and 
not sufficiently in the areas that are important for V* ; see 
Beygelzimer & Langford! (120091) for a more refined analy- 
sis. 

The second approach, called inverse propensity score (IPS), 
is typically less prone to problems with bias. Instead of 
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approximating the reward, IPS forms an approximation 
p(a | x, h) of p(a \ x, h), and uses this estimate to correct 
for the shift in action proportions between the old, data- 
collection policy and the new policy: 



T/-7T 
1 roc 



IPS 



1 r a I(7r(a;) = a) 

\S\ ^ p(a\x,h) 

1 1 (x,h,a,r a )€S 



where I(-) is an indicator function evaluating to one if its 
argument is true and zero otherwise. If p(a x, h) ss 
p(a | x, h) then the IPS estimate above will be, approxi- 
mately, an unbiased estimate of V r , Since we typically 
have a good (or even accurate) understanding of the data- 
collection policy, it is often easier to obtain a good esti- 
mate p, and thus IPS estimator is in practice less suscepti- 
ble to problems with bias compared with the direct method. 
However, IPS typically has a much larger variance, due to 
the range of the random variable increasing. The issue be- 
comes more severe when p(a | x, h) gets smaller. Our ap- 
proach alleviates the large variance problem of IPS by tak- 
ing advantage of the estimate g used by the direct method. 

2.2. Doubly Robust Estimator 

Doubly robust estimators take advantage of both the esti- 
mate of the expected reward g a (x) and the estimate of ac- 
tion probabilities p(a | x,h). Here, we use a PR e stimator 
of the form first suggested by ICassel et al. ( 1976 ) for re- 
gression, but previously not studied for policy learning: 



-trir 



1 

W\ 



(x,h,a,r a )^S 



(r a - Qa(x))I(ir(x) = a) 



p(a | x, h) 



(1) 



Informally, the estimator uses g as a baseline and if there is 
data available, a correction is applied. We will see that our 
estimator is accurate if at least one of the estimators, g and 
p, is accurate, hence the name doubly robust. 

In practice, it is rare to have an accurate estimation of either 
g or p. Thus, a basic question is: How does this estimator 
perform as the estimates g and p deviate from the truth? 
The following two sections are dedicated to bias and vari- 
ance analysis, respectively, of the DR estimator. 

3. Bias Analysis 

Let A denote the additive deviation of g from g, and 6 a 
multiplicative deviation of p from p: 

A(a,x) = g a (x) - Q a {%), 

5(a, x, h) = 1 — p(a \ x, h) /p(a \x,h) . 

We express the expected value of V^ R using <5(-, •, •) and 
A(-, •). To remove clutter, we introduce shorthands g a for 



Qa{x), g a for Q a (x), I for I(tt(x) = a), p for p(n(x) | 
x, h), p for p(ir(x) \ x, h), A for A(7r(a;), x)), and S for 
5(ir(x), x, h). In our analysis, we assume that the estimates 
p and g are fixed independently of S (e.g., by splitting the 
original data set into S and a separate portion for estimating 
p and g). To evaluate E[V^ R ], it suffices to focus on a single 
term in Eq. (|T), conditioning on h: 



E 



(:r,r)~.D,a~p(- \ x,h) 



+ Qtv(x) 



,f,a\h 



(r a -g a - A)I 



A 



(2) 



E 



x,a\ h 



{g a - Q a )I 



A(l - I/p) 



Ex{Qtt(x 



Ml 



E x , h [A(l - p/p)} +V n = E x[h [A6] + V* 



Even though x is independent of h, the conditioning on h 
remains in the last line, because S, p and p are functions 
of h. Summing across all terms in Eq. (fl~|i, we obtain the 
following theorem: 

Theorem 1 Let A and 8 be defined as above. Then, the 
bias of the doubly robust estimator is 



i 

W\ 



E S [ E AS 

(x,h)es 



If the past policy and the past policy estimate are stationary 
(i.e., independent of h), the expression simplifies to 

|E[F D \]-n = |E*[AJ]| . 
In contrast (for simplicity we assume stationarity): 



|E[Vj? M ] 
|Efe] 



■^| = |E X [A]| 
V*\ = \Ex[g, { x)S}\ , 



where the second equality is based on the observation that 
IPS is a special case of DR for g a {x) = 0. 

In general, neither of the estimators dominates the others. 
However, if either A « 0, or 5 « 0, the expected value of 
the doubly robust estimator will be close to the true value, 
whereas DM requires A«0 and IPS requires 5 « 0. Also, 
if A « and 5 « 1, DR will still outperform DM, and 
similarly for IPS with roles of A and 5 reversed. Thus, DR 
can effectively take advantage of both sources of informa- 
tion for better estimation. 

4. Variance Analysis 

In the previous section, we argued that the expected value 
of V£ R compares favorably with IPS and DM. In this sec- 
tion, we look at the variance of DR. Since large deviation 
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bounds have a primary dependence on variance, a lower 
variance implies a faster convergence rate. We treat only 
the case with stationary past policy, and hence drop the de- 
pendence on h throughout. 

As in the previous section, it suffices to analyze the second 
moment (and then variance) of a single term of Eq. (dJ. 
We use a similar decomposition as in Eq. (|2). To simplify 
derivation we use the notation e = (r a — g a )l/p. Note that, 
conditioned on x and a, the expectation of e is zero. Hence, 
we can write the second moment as 

\ 2-1 

(r - Q a )I \ 

+ Qtt(x) 



V 



+ e x , q [a 2 (i-i/p) 2 ; 



A(l-I/p)] 



E„- Q [e 2 ] + E x [qI {x) ] + 2E x [g„ (x) A8] 
+ E x [A 2 (l-2p/p + p/f)] 
E„- Q [e 2 ] + E x [qI {x) ] + 2E x [g^ x) A8] 
+ E4A 2 (1 - 2p/p + p 2 /p 2 +p(l-p)/f)} 

Ex, r >[£ 2 ] +Ex[(Qn( x ) + A<5) 2 ] 
+ E X [A 2 -p(l-p)/p 2 ] 

E xAa [e 2 } +E x [( g7l{x) + A5) 2 ] 

1-p 



E : r 



A 2 (l-<5)^ 



Summing across all terms in Eq. ([T) and combining with 
TheoremQ] we obtain the variance: 

Theorem 2 Let A, 8 and e be defined as above. If the 
past policy and the policy estimate are stationary, then the 
variance of the doubly robust estimator is 




Thus, the variance can be decomposed into three terms. 
The first accounts for randomness in rewards. The second 
term is the variance of the estimator due to the randomness 
in x. And the last term can be viewed as the importance 
weighting penalty. A similar expression can be derived for 
the IPS estimator: 



Var [Vgtj] = — ( E x ^ a [e 2 } + V&r x [g Ax) - g„^S] 



E, 



V^(*)(l-*) 2 



The first term is identical, the second term will be of similar 
magnitude as the corresponding term of the DR estimator, 
provided that 8 « 0. However, the third term can be much 
larger for IPS if p(tt(x) x) <C 1 and |A| is smaller than 
Q n t x y In contrast, for the direct method, we obtain 

Var [VS M ] = ^Var, [ Q<x) + A] . 

Thus, the variance of the direct method does not have terms 
depending either on the past policy or the randomness in 
the rewards. This fact usually suffices to ensure that it is 
significantly lower than the variance of DR or IPS. How- 
ever, as we mention in the previous section, the bias of the 
direct method is typically much larger, leading to larger er- 
rors in estimating policy value. 

5. Experiments 

This section provides empirical evidence for the effective- 
ness of the DR estimator compared to IPS and DM. We 
consider two classes of problems: multiclass classification 
with bandit feedback in public benchmark datasets and es- 
timation of average user visits to an Internet portal. 

5.1. Multiclass Classification with Bandit Feedback 

We begin with a description of how to turn a fc-class clas- 
sification task into a fc-armed contextual bandit problem. 
This transformation allows us to compare IPS and DR us- 
ing public datasets for both policy evaluation and learning. 

5.1.1. Data Setup 

In a classification task, we assume data are drawn IID 
from a fixed distribution: (x, c) ~ D, where x G X 
is the feature vector and c G {1,2,..., fc} is the class 
label. A typical goal is to find a classifier tt : X i-> 
{1,2,..., k} minimizing the classification error: e(-7r) = 

E( x ,c)~D [I(k(x) ^ c)] . 

Alternatively, we may turn the data point (x, c) into a cost- 
sensitive classification example (x, l\, I2, ■ ■ ■ , h), where 
l a =I(a ^ c) is the loss for predicting a. Then, a classifier 
7r may be interpreted as an action-selection policy, and its 
classification error is exactly the policy's expected lossQ 

To construct a partially labeled dataset, exactly one loss 
componen t for each example is observed, fo llowing the ap- 
proach of iBe^gelzimerjfcLangford] (|2009|). Specifically, 
given any (x, I1J2, ■ ■ ■ , h), we randomly select a label 
a ~ UNIF(1, 2, . . . , k), and then only reveal the compo- 
nent l a . The final data are thus in the form of (x, a, l a ), 



'When considering classification problems, it is more natural 
to talk about minimizing classification errors. This loss minimiza- 
tion problem is symmetric to the reward maximization problem 
defined in Section[2] 
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Dataset 


ecoli 


glass 


letter 


optdigits 


page-blocks 


pendigits 


satimage 


vehicle 


yeast 


Classes (fc) 


8 


6 


26 


10 


5 


10 


6 


4 


10 


Dataset size 


336 


214 


20000 


5620 


5473 


10992 


6435 


846 


1484 



Table 1. Characteristics of benchmark datasets used in Section |5TI 



which is the form of data defined in Section [2] Further- 
more, p(a | x) = 1/k and is assumed to be known. 

TableEQsummarizes the benchmark problems ado pted from 
the UCI repository (I Asuncion & Newman! 120071) . 



5.1.2. Policy Evaluation 

Here, we investigate whether the DR technique indeed 
gives more accurate estimates of the policy value (or clas- 
sification error in our context). For each dataset: 



1. 



We randomly split data into training and test sets of 
(roughly) the same size; 

On the training set with fully revealed losses, we 
run a direct loss minim ization (DLM) algorithm of 



McAllester et alJ (1201 II) to obtain a classifier (see Ap- 



pendix [A] for details). This classifier constitutes the 
policy 7r which we evaluate on test data; 

3. We compute the classification error on fully observed 
test data. This error is treated as the ground truth for 
comparing various estimates; 

4. Finally, we apply the transformation in Section 15.1.11 
to the test data to obtain a partially labeled set, from 
which DM, IPS, and DR estimates are computed. 

Both DM and DR require estimating the expected condi- 
tional loss denoted as l(x, a) for given (x,a). We use a 
linear loss model: l(x,a) — w a ■ x, parameterized by k 
weight vectors {w a } ae si fe j, and use least-squares ridge 
regression to fit w a based on the training set. Step @] is 
repeated 500 times, and the resulting bias and rmse (root 
mean squared error) are reported in Fig. [T] 

As predicted by analysis, both IPS and DR are unbiased, 
since the probability estimate 1/k is accurate. In contrast, 
the linear loss model fails to capture the classification error 
accurately, and as a result, DM suffers a much larger bias. 

While IPS and DR estimators are unbiased, it is apparent 
from the rmse plot that the DR estimator enjoys a lower 
variance. As we shall see next, such an effect is substantial 
when it comes to policy optimization. 

5.1.3. Policy Optimization 

We now consider policy optimization (classifier learning). 
Since DM is significantly worse on all datasets, as indicated 
in Fig.[T] we focus on the comparison between IPS and DR. 

Here, we apply the data transformation in Section B.l.ll to 



IPS 
DR 
DM 




Figure 1. Bias (upper) and rmse (lower) of the three estimators for 
classification error. See Table [2]for precise numbers. 



the training data, and then learn a classifier based on the 
loss estimated by IPS and DR, respectively. Specifically, 
for each dataset, we repeat the following steps 30 times: 

1. We randomly split data into training (70%) and test 
(30%) sets; 

2. We apply the transformation in Section 15.1.11 to the 
training data to obtain a partially labeled set; 

3. We then use the IPS and DR estimators to impute un- 
revealed losses in the training data; 

4. Two cost-sensitive multiclass classification algo- 
rithms are used to learn a classifier from the 
losses completed by ei t her IP S or DR: the first is 
DLM dMcAllester et all 1201 ll). the other i s the Filter 
Tree reduction of Beygelzimer et al.l ( 2008 1 applied to 
a decision tree (see AppendixlBlfor more details); 
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glclSS 


letter 


ontdi pits 


1 1 (.1 cL \s I ' 1 V ' v l\ 1 1 


npnHi pits 


-14.1 1. 11 I KL^v 


vphiclp 

V v 1 1 Iv 1 V_ 


yea.st 


bins 




003 


n 

U 


n 


n 
u 


n 
u 


n 
u 




OOfi 


bias (DR) 


0.002 


0.001 


0.001 














0.001 


0.007 


bias (DM) 


0.129 


0.147 


0.213 


0.175 


0.063 


0.208 


0.174 


0.281 


0.193 


rmse (IPS) 


0.137 


0.194 


0.049 


0.023 


0.012 


0.015 


0.021 


0.062 


0.099 


rmse (DR) 


0.101 


0.142 


0.03 


0.023 


0.011 


0.016 


0.019 


0.058 


0.076 


rmse (DM) 


0.129 


0.147 


0.213 


0.175 


0.063 


0.208 


0.174 


0.281 


0.193 



Table 2. Comparison of results in Figure [TJ 



Dataset 


ecoli 


glass 


letter 


optdigits 


page-blocks 


pendigits 


satimage 


vehicle 


yeast 


IPS (DLM) 
DR (DLM) 


0.52933 
0.28853 


0.6738 
0.50157 


0.93015 
0.60704 


0.64403 
0.09033 


0.08913 
0.0831 


0.5358 
0.12663 


0.40223 
0.17133 


0.39507 
0.31603 


0.72973 
0.5292 


IPS (FT) 
DR (FT) 


0.46563 
0.32583 


0.90783 
0.45807 


0.9393 
0.47197 


0.84017 
0.17793 


0.3701 
0.05283 


0.73123 
0.0956 


0.69313 
0.18647 


0.63517 
0.38753 


0.81147 
0.59053 


Offset Tree 


0.34007 


0.52843 


0.5837 


0.3251 


0.04483 


0.15003 


0.20957 


0.37847 


0.5895 



Table 3. Comparison of results in Figure [2] 



5. Finally, we evaluate the learned classifiers on the test 
data to obtain classification error. 

Again, we use least-squares ridge regression to build a lin- 
ear loss estimator: l(x,a) = w a • x. However, since the 
training data is partially labeled, w a is fitted only using 
training data (x, a' , l a >) for which a = a'. 

Average classification errors (obtained in Step|5]above) of 
the 30 runs are plotted in Fig. [2] Clearly, for policy opti- 
mization, the advantage of the DR is even greater than for 
policy evaluation. In all datasets, DR provides substantially 
more reliable loss estimates than IPS, and results in signif- 
icantly improved classifiers. 

Fig. |2] also includes classification error of the Offset Tree 
reduction, which is designed specifically for policy opti- 
mization with partially labeled data@ While the IPS ver- 
sions of DLM and Filter Tree are rather weak, the DR ver- 
sions are competitive with Offset Tree in all datasets, and 
in some cases significantly outperform Offset Tree. 

Finally, we note DR provided similar improvements to two 
very different algorithms, one based on gradient descent, 
the other based on tree induction. It suggests the generality 
of DR when combined with different algorithmic choices. 

5.2. Estimating Average User Visits 

The next problem we consider is estimating the average 
number of user visits to a popular Internet portal. Real 
user visits to the website were recorded for about 4 mil- 



lion bcookie randomly selected from all bcookies during 
March 2010. Each bcookie is associated with a sparse bi- 
nary feature vector of size around 5000. These features 
describe browsing behavior as well as other information 
(such as age, gender, and geographical location) of the 
bcookie. We chose a fixed time window in March 2010 and 
calculated the number of visits by each selected bcookie 
during this window. To summarize, the dataset contains 
N = 3854689 data: D = {(b l ,x t ,v i )} t=1 ^^ N , where b, 
is the i-th (unique) bcookie, Xi is the corresponding binary 
feature vector, and Vi is the number of visits. 

If we can sample from D uniformly at random, the sample 
mean of Vj will be an unbiased estimate of the true aver- 
age number of user visits, which is 23.8 in this problem. 
However, in various situations, it may be difficult or im- 
possible to ensure a uniform sampling scheme due to prac- 
tical constraints, thus the sample mean may not reflect the 
true quantity of interest. This is known as covariate shift, 
a special case of our problem formulated in Section [2] with 
k = 2 arms. Formally, the partially labeled data consists 
of tuples (xi, ai, r^), where £ {0, 1} indicates whether 
bcookie hi is sampled, r; = is the observed number of 
visits, and pi is the probability that a; = 1. The goal here 
is to evaluate the value of a constant policy: n(x) = 1. 

To define the samp ling probabilities Pi, we adopted a sim- 
ilar approach as in iGretton et al.l (120081) . In particular, we 
obtained the first principal component (denoted x) of all 
features {xi}, and projected all data onto x. Let Af be a 
univariate normal distribution with mean m + (fh — m)/3 



We used decision trees as the base learner in Offset 
Trees. The numbers reported h ere are not identical to those by 
iBeygelzimer & Langfordl (|T009j) probably because the filter-tree 
structures in our implementation were different. 



3 A bcookie is unique string that identifies a user. Strictly 
speaking, one user may correspond to multiple bcookies, but it 
suffices to equate a bcookie with a user for our purposes here. 
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Figure 2. Classification error of DLM (upper) and filter tree 
(lower). Note that the representations used by DLM and the 
trees differ radically, conflating any comparison between the ap- 
proaches. However, the Offset and Filter Tree approaches share a 
similar representation, so differences in performance are purely a 
matter of superior optimization. See Table|3]for precise numbers. 



and standard deviation (m — m)/4, where m and m were 
the minimum and mean of the projected values. Then, 
Pi = min{7V(a;,; • x), 1} was the sampling probability of 
the i-th bcookie, &j. 

To control data size, we randomly subsampled a fraction 
/ G {0.0001,0.0005,0.001,0.005,0.01,0.05} from the 
entire dataset D. For each bcookie hi in this subsample, 
set at = 1 with probability pi, and = otherwise. We 
then calculated the IPS and DR estimates on this subsam- 
ple. The whole process was repeated 100 times. 

The DR estimator required building a reward model q(x), 
which, given feature x, predicted the average number of 
visits. Again, least-squares ridge regression was used to fit 
a linear model g(x) = w ■ x from sampled data. 

Fig.[3]summarizes the estimation error of the two methods 
with increasing data size. For both IPS and DR, the esti- 
mation error goes down with more data. In terms of rmse, 
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Figure 3. Comparison of IPS and DR: rmse (top), bias (bottom). 
The ground truth value is 23.8. 



the DR estimator is consistently better than IPS, especially 
when dataset size is smaller. The DR estimator often re- 
duces the rmse by a fraction between 10% and 20%, and 
on average by 13.6%. By comparing to the bias and std 
metrics, it is clear that DR's gain of accuracy came from a 
lower variance, which accelerated convergence of the esti- 
mator to the true value. These results confirm our analysis 
that DR tends to reduce variance provided that a reasonable 
reward estimator is available. 

6. Conclusions 

Doubly robust policy estimation is an effective technique 
which virtually always improves on the widely used inverse 
propensity score method. Our analysis shows that doubly 
robust methods tend to give more reliable and accurate es- 
timates. The theory is corroborated by experiments on both 
benchmark data and a large-scale, real-world problem. 

In the future, we expect the DR technique to become 
common practice in improving contextual bandit algo- 
rithms. As an example, it is interesting to develop a vari- 
ant of Offset Tree that can take advantage of better re- 
ward models, rather than a crude, c onstant reward esti- 
mate ( Bevgelzimer & Langfordl 2009 ). 
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A. Direct Loss Minimization 

Given cost-sensitive multiclass classification data 
{(x, h, . . . , Ik)}, we perform approximate gradient descent 
on the policy loss (or classification error). In the experiments 
of Section 15.11 policy tt is specified by k weight vectors 
0i,...,8k- Given x € X, the policy predicts as follows: 
ty(x) = argmax ae{1: ... jfe} {x ■ 6 a }. 

To optimize 9 a , we adapt the "to wards-better" version of the di- 
rect loss minimization method of McAlles ter et alj d201l[) as fol- 
lows: given any data (x, li, . . . , Ik) and the current weights 9 a , 
the weights are adjusted by 9 ai <— ai + r/x,9 a2 •<— 8 a2 — r\x 
where a\ — argmax a {x ■ 9 a — el a }, 0,2 = argmax a {x ■ 6 a }, 
•q G (0, 1) is a decaying learning rate, and e > is an input 
parameter. 

For computational reasons, we actually performed batched up- 
dates rather than incremental updatess. We found that the learn- 
ing rate 77 = t~°' 3 /2, where t is the batched iteration, worked 
well across all datasets. The parameter e was fixed to 0.1 for all 
datasets. Updates continued until the weights converged. 

Furthermore, since the policy loss is not convex in the weight vec- 
tors, we repeated the algorithm 20 times with randomly perturbed 
starting weights and then returned the best run's weight according 
to the learned policy's loss in the training data. We also tried us- 
ing a holdout validation set for choosing the best weights out of 
the 20 candidates, but did not observe benefits from doing so. 

B. Filter Tree 

The Filter Tree llBevgelzimer et all [2008) is a reduction from 
cost-sensitive classification to binary classification. Its input is 
of the same form as for Direct Loss Minimization, but its output 
is a binary-tree based predictor where each node of the Filter Tree 
uses a binary classifier — in this case the J48 decision tree imple- 
mented in Weka 3.6.4 dHall etall[2009l) . Thus, there are 2-class 
decision trees in the nodes, with the nodes arranged as per a Fil- 
ter Tree. Training in a Filter Tree proceeds bottom-up, with each 
trained node filtering the examples observed by its parent until the 
entire tree is trained. 

Testing proceeds root-to-leaf, implying that the test time compu- 
tation is logarithmic in the number of classes. We did not test the 
all-pairs Filter Tree, which has test time computation linear in the 
class count similar to DLM. 
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